CKME 136 - Capstone Project [Farrukh Aziz]

This project aims to cluster restaurants in 10 metropolitan cities of North America in contiguous groups of geo-spatial locations. Then the insight into the interest of customers that review restaurants of a particular cluster is used to indicate supply and demand porportions of various categories of restaunts.

Index

Please click links to jump to the specific area:

Import Libraries
Attribute Analysis
Attribute Selection
Data Clustering
User Data Preparation
Mapping Data Preparation
Data Visualization

Please click links below to access interactive diagrams and maps:
DBSCAN Min. Neighbors & Distance vs Coverage
DBSCAN Min. Neighbors & Distance vs Cluster Count
DBSCAN Min. Neighbors & Distance vs Largest Cluster Size
DBSCAN Label Count Histogram for Min. Neighbors & Distance
DBSCAN Min. Neighbors & Distance vs Coverage

North America Clustered Restaurants by Location (All Categories)
All Clustered Restaurants on Sketch (Toronto)
All Clustered Restaurants on Map (Toronto)
Slider Controlled Categories Displaying Demand (Toronto)



Import Libraries

Import required libraries

For some of the thirs party libraries, you may have to run 'pip install' commands e.g.

  • pip install geopy
  • pip install shaply
  • pip install matplotlib
  • pip install plotly
  • pip install cufflinks
In [798]:
#pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time, plotly.plotly as py, plotly.graph_objs as go
import multiprocessing as mp
from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from plotly import tools
from collections import Counter
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline

init_notebook_mode(connected=True)


Read Data

Read data from the file

The following files from Yelp Dataset will be used:

  1. yelp_academic_dataset_business.json : Business location coordinates, locations and categories etc.
  2. yelp_academic_dataset_review.json : User reviews, related business ids etc.

Rest of the data from the dataset is not useful for the purposes of this project.

In [394]:
business_file = "yelp_academic_dataset_business.json"
review_file = "yelp_academic_dataset_review.json"

start_time = time.time()

df_business_data_full = pd.read_json(business_file, lines=True)
df_review_data_full = pd.read_json(review_file, lines=True)

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Time taken: 78.19 seconds


Attribute Analysis

Reveal datatypes and sample data rows from the DataSet

1. Business data datatypes and attributes

In [395]:
df_business_data_full.info()
df_business_data_full.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188593 entries, 0 to 188592
Data columns (total 15 columns):
address         188593 non-null object
attributes      162807 non-null object
business_id     188593 non-null object
categories      188052 non-null object
city            188593 non-null object
hours           143791 non-null object
is_open         188593 non-null int64
latitude        188587 non-null float64
longitude       188587 non-null float64
name            188593 non-null object
neighborhood    188593 non-null object
postal_code     188593 non-null object
review_count    188593 non-null int64
stars           188593 non-null float64
state           188593 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 21.6+ MB
Out[395]:
address attributes business_id categories city hours is_open latitude longitude name neighborhood postal_code review_count stars state
0 1314 44 Avenue NE {'BikeParking': 'False', 'BusinessAcceptsCredi... Apn5Q_b6Nz61Tq4XzPdf9A Tours, Breweries, Pizza, Restaurants, Food, Ho... Calgary {'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'... 1 51.091813 -114.031675 Minhas Micro Brewery T2E 6L6 24 4.0 AB
1 {'Alcohol': 'none', 'BikeParking': 'False', 'B... AjEbIBw6ZFfln7ePHha9PA Chicken Wings, Burgers, Caterers, Street Vendo... Henderson {'Friday': '17:0-23:0', 'Saturday': '17:0-23:0... 0 35.960734 -114.939821 CK'S BBQ & Catering 89002 3 4.5 NV
2 1335 rue Beaubien E {'Alcohol': 'beer_and_wine', 'Ambience': '{'ro... O8S5hYJ1SMc8fA4QBtVujA Breakfast & Brunch, Restaurants, French, Sandw... Montréal {'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'... 0 45.540503 -73.599300 La Bastringue Rosemont-La Petite-Patrie H2G 1K7 5 4.0 QC

2. Review data types and attributes

In [396]:
df_review_data_full.info()
df_review_data_full.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5996996 entries, 0 to 5996995
Data columns (total 9 columns):
business_id    object
cool           int64
date           datetime64[ns]
funny          int64
review_id      object
stars          int64
text           object
useful         int64
user_id        object
dtypes: datetime64[ns](1), int64(4), object(4)
memory usage: 411.8+ MB
Out[396]:
business_id cool date funny review_id stars text useful user_id
0 iCQpiavjjPzJ5_3gPD5Ebg 0 2011-02-25 0 x7mDIiDB3jEiPGPHOmDzyw 2 The pizza was okay. Not the best I've had. I p... 0 msQe1u7Z_XuqjGoqhB0J5g
1 pomGBqfbxcqPv14c3XH-ZQ 0 2012-11-13 0 dDl8zu1vWPdKGihJrwQbpw 5 I love this place! My fiance And I go here atl... 0 msQe1u7Z_XuqjGoqhB0J5g
2 jtQARsP6P-LbkyjbO1qNGg 1 2014-10-23 1 LZp4UX5zK3e-c5ZGSeo3kA 1 Terrible. Dry corn bread. Rib tips were all fa... 3 msQe1u7Z_XuqjGoqhB0J5g


Attribute Selection

Reduce the attribute list to only the useful information for this project.

  • From business data, [ business_id, latitude, longitude, city, state, postal_code and categories ]
  • From review data, [ business_id, review_id, user_id ]
In [397]:
start_time = time.time()

business_cols = ['business_id', 'latitude', 'longitude', 'city', 'neighborhood', \
                 'state', 'postal_code', 'stars', 'categories']
review_cols = ['business_id', 'review_id', 'user_id']

df_business_data = df_business_data_full.filter(business_cols , axis=1)
df_review_data = df_review_data_full.filter(review_cols , axis=1)

df_review_data.to_pickle('df_review_data.pkl')

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))

df_business_data.head()
Time taken: 8.39 seconds
Out[397]:
business_id latitude longitude city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 Calgary AB T2E 6L6 4.0 Tours, Breweries, Pizza, Restaurants, Food, Ho...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 Henderson NV 89002 4.5 Chicken Wings, Burgers, Caterers, Street Vendo...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 Breakfast & Brunch, Restaurants, French, Sandw...
3 bFzdJJ3wp3PZssNEsyU23g 33.449999 -112.076979 Phoenix AZ 85003 1.5 Insurance, Financial Services
4 8USyCYqpScwiNEb58Bt6CA 51.035591 -114.027366 Calgary AB T2H 0N5 2.0 Home & Garden, Nurseries & Gardening, Shopping...

Prepare Business Data

For the scope of this project, filter down to the businesses located within US/Canada and the ones that have been categorized. This will eliminate noise and small number of restaurants that have not been categorized. Limiting the data to US/Canada will help fit it within North American Map coordinates while retaining majority of the data.

Note: Business categories could be inferred based on user reviews, however, that is outside the scope of this project

In [404]:
start_time = time.time()

print(df_business_data.shape)

north_american_state_provinces = ['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', \
                                  'DE', 'FL', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', \
                                  'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', \
                                  'MP', 'MS', 'MT', 'NA', 'NC', 'ND', 'NE', 'NH', 'NJ', \
                                  'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', \
                                  'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', \
                                  'WI', 'WV', 'WY','AB', 'BC', 'MB', 'NB', 'NL', 'NT', \
                                  'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT']

for idx, row in df_business_data.iterrows():
    if row['state'] not in north_american_state_provinces:
        df_business_data.drop(idx, inplace=True)

df_business_data = df_business_data[df_business_data['categories'].notnull()]


df_business_data.to_pickle('df_business_data.pkl')

print(df_business_data.shape)

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.head()
(188593, 9)
(187464, 9)
Time taken: 44.43 seconds
Out[404]:
business_id latitude longitude city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 Calgary AB T2E 6L6 4.0 Tours, Breweries, Pizza, Restaurants, Food, Ho...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 Henderson NV 89002 4.5 Chicken Wings, Burgers, Caterers, Street Vendo...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 Breakfast & Brunch, Restaurants, French, Sandw...
3 bFzdJJ3wp3PZssNEsyU23g 33.449999 -112.076979 Phoenix AZ 85003 1.5 Insurance, Financial Services
4 8USyCYqpScwiNEb58Bt6CA 51.035591 -114.027366 Calgary AB T2H 0N5 2.0 Home & Garden, Nurseries & Gardening, Shopping...

Filter down to the list of businesses that are categorized as Restaurants

  • Cleanup and build list of all categories
  • Filter down to rows containing Restaurants as category
  • Display number of rows before and after
In [405]:
start_time = time.time()

# create copy so that original business_data is intact
df_business_categorized_data = df_business_data.copy()

df_business_categorized_data['categories'] = df_business_data['categories'] \
.map(lambda x : (list(map(str.strip, x.split(',')))))

print('Total data rows and columns:{}'.format(df_business_categorized_data.shape))

df_restaurants = df_business_categorized_data[df_business_categorized_data['categories'] \
                                              .map(lambda x : 'Restaurants' in x)]

print('Restaurant data rows and columns:{}'.format(df_restaurants.shape))

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Total data rows and columns:(187464, 9)
Restaurant data rows and columns:(56839, 9)
Time taken: 0.86 seconds

Since we have dropped all rows that don't have Restaurant as category, dataframe must be re-indexed to fill the gaps.

In [406]:
df_restaurants = df_restaurants.reset_index(drop=True)
df_restaurants.head()
Out[406]:
business_id latitude longitude city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 Calgary AB T2E 6L6 4.0 [Tours, Breweries, Pizza, Restaurants, Food, H...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 Henderson NV 89002 4.5 [Chicken Wings, Burgers, Caterers, Street Vend...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Breakfast & Brunch, Restaurants, French, Sand...
3 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 Mississauga Ridgewood ON L4T 1A8 2.0 [Restaurants, Thai]
4 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 Avondale AZ 85323 2.5 [Mexican, Restaurants]

Display list of unique states. We will use one of these states to cluster at state level to make sense of the clustered data

In [407]:
# list of states included in the dataset
df_restaurants.state.unique()
Out[407]:
array(['AB', 'NV', 'QC', 'ON', 'AZ', 'OH', 'IL', 'WI', 'PA', 'NC', 'SC',
       'IN', 'CO', 'VA', 'NY', 'OR', 'CA', 'MO', 'FL', 'BC'], dtype=object)

Top 20 Categories: Combine categories into a list and count top 50. Top 20 categories that represent actual food categories will be used for analysis.

In [408]:
start_time = time.time()

all_categories = df_restaurants['categories'].sum()

ct = Counter(all_categories)

top_50_categories = [x[0] for x in list(ct.most_common(50))]

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))

print(top_50_categories)
Time taken: 86.24 seconds
['Restaurants', 'Food', 'Nightlife', 'Bars', 'Sandwiches', 'Fast Food', 'American (Traditional)', 'Pizza', 'Burgers', 'Breakfast & Brunch', 'Italian', 'Mexican', 'Chinese', 'American (New)', 'Coffee & Tea', 'Cafes', 'Japanese', 'Chicken Wings', 'Seafood', 'Salad', 'Event Planning & Services', 'Sushi Bars', 'Delis', 'Canadian (New)', 'Asian Fusion', 'Mediterranean', 'Barbeque', 'Sports Bars', 'Specialty Food', 'Caterers', 'Steakhouses', 'Desserts', 'Bakeries', 'Indian', 'Thai', 'Pubs', 'Diners', 'Vietnamese', 'Middle Eastern', 'Vegetarian', 'Greek', 'French', 'Wine Bars', 'Cocktail Bars', 'Korean', 'Ice Cream & Frozen Yogurt', 'Beer', 'Wine & Spirits', 'Buffets', 'Arts & Entertainment']

Since we are calculating demand for specific categories of food. To limit the scope of this project, we choose top 20 specific categories of foods that the businesses belong to:

  1. Sandwiches
  2. American (Traditional)
  3. Pizza
  4. Burgers
  5. Italian
  6. Mexican
  7. Chinese
  8. American (New)
  9. Japanese
  10. Chicken Wings
  11. Seafood
  12. Sushi Bars
  13. Canadian (New)
  14. Asian Fusion
  15. Mediterranean
  16. Steakhouses
  17. Indian
  18. Thai
  19. Vietnamese
  20. Middle Eastern
In [409]:
top_20_specific_categories = ['Sandwiches', 'American (Traditional)', 'Pizza', \
                              'Burgers', 'Italian', 'Mexican', 'Chinese', \
                              'American (New)', 'Japanese', 'Chicken Wings', \
                              'Seafood', 'Sushi Bars', 'Canadian (New)', \
                              'Asian Fusion', 'Mediterranean', 'Steakhouses', \
                              'Indian', 'Thai', 'Vietnamese', 'Middle Eastern']
len(top_20_specific_categories)
Out[409]:
20

Category Reduction: Reduce categories of each business to only include top 20 categories. All categories other than the top 20 selected above are removed for optimization since they are not useful for the purpose of this analysis.

In [410]:
for idx, row in df_restaurants.iterrows():
    categories = row['categories']
    new_categories = list(set(categories) & set(top_20_specific_categories))
    df_restaurants.at[idx, 'categories'] = new_categories
df_restaurants.head()
Out[410]:
business_id latitude longitude city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 Calgary AB T2E 6L6 4.0 [Pizza]
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 Henderson NV 89002 4.5 [Burgers, Chicken Wings]
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches]
3 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 Mississauga Ridgewood ON L4T 1A8 2.0 [Thai]
4 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 Avondale AZ 85323 2.5 [Mexican]

Getting Dummies Create one column per category within the dataframe with value 1 if that category applies to the business, 0 otherwise. It uses similar approach to Get Dummies which is often used in pandas for optmization.

In [411]:
df_category_flags = pd.DataFrame(0, index=np.arange(len(df_restaurants)), \
                                 columns=top_20_specific_categories)

for index, row in df_restaurants.iterrows():
    for category in row['categories']:
        df_category_flags.at[index, category] = 1

pd.DataFrame(df_category_flags.sum(), columns=['Count'])
Out[411]:
Count
Sandwiches 6910
American (Traditional) 6649
Pizza 6578
Burgers 5114
Italian 4503
Mexican 4412
Chinese 4235
American (New) 4229
Japanese 2565
Chicken Wings 2537
Seafood 2356
Sushi Bars 2153
Canadian (New) 1828
Asian Fusion 1775
Mediterranean 1741
Steakhouses 1522
Indian 1409
Thai 1388
Vietnamese 1225
Middle Eastern 1182

Replace category list for each restaurant with binary flag for each category within restaurants data

In [412]:
df_restaurants_flagged = df_restaurants.join(df_category_flags)
print(len(df_restaurants_flagged))
df_restaurants_flagged.head()
56839
Out[412]:
business_id latitude longitude city neighborhood state postal_code stars categories Sandwiches ... Seafood Sushi Bars Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 Calgary AB T2E 6L6 4.0 [Pizza] 0 ... 0 0 0 0 0 0 0 0 0 0
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 Henderson NV 89002 4.5 [Burgers, Chicken Wings] 0 ... 0 0 0 0 0 0 0 0 0 0
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches] 1 ... 0 0 0 0 0 0 0 0 0 0
3 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 Mississauga Ridgewood ON L4T 1A8 2.0 [Thai] 0 ... 0 0 0 0 0 0 0 1 0 0
4 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 Avondale AZ 85323 2.5 [Mexican] 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

Save flagged restaurants to easily load for analysis later

In [413]:
df_restaurants_flagged.to_pickle('df_restaurants_flagged.pkl')
In [414]:
df_supply_indicator_by_category = df_restaurants_flagged.filter(top_20_specific_categories).sum()
df_supply_indicator_by_category.to_frame('Supply (Restaurant Count)')
Out[414]:
Supply (Restaurant Count)
Sandwiches 6910
American (Traditional) 6649
Pizza 6578
Burgers 5114
Italian 4503
Mexican 4412
Chinese 4235
American (New) 4229
Japanese 2565
Chicken Wings 2537
Seafood 2356
Sushi Bars 2153
Canadian (New) 1828
Asian Fusion 1775
Mediterranean 1741
Steakhouses 1522
Indian 1409
Thai 1388
Vietnamese 1225
Middle Eastern 1182


Data Clustering

We need to set parameters for clustering restaurants using DBSCAN algorithm:

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
In [95]:
kms_per_radian = 6371.0088
epsilon = 0.5 / kms_per_radian
min_neighbors = 4
In [5]:
kms_per_radian = 6371.0088
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_population_size_compare = pd.DataFrame(0, index=range(0,255), \
                                          columns=['Minimum Neighbors','Epsilon(m)','Coverage','Count'])
In [800]:
start_mn = 3
end_mn = 20
start_eps = 50
end_eps = 1500

start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
    for e in range(start_eps, end_eps+50, 50):
        eps = e/1000/kms_per_radian
        dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
        .fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
        cluster_coverage = sum(dbscn.labels_ >= mn)
        cluster_count = sum(np.unique(dbscn.labels_) >= mn)
        compression = 100*(1 - float(cluster_count) / cluster_coverage)
        df_population_size_compare.at[indx, 'Minimum Neighbors'] = mn
        df_population_size_compare.at[indx, 'Epsilon(m)'] = e
        df_population_size_compare.at[indx, 'Coverage'] = cluster_coverage
        df_population_size_compare.at[indx, 'Count'] = cluster_count
        df_population_size_compare['Compression'] = compression
        indx = indx + 1
        print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
        
df_population_size_compare.head()
Out[800]:
Minimum Neighbors Epsilon(m) Coverage Count Compression
0 3.0 50.0 29645.0 4808.0 83.781413
1 3.0 100.0 41927.0 4838.0 88.460896
2 3.0 150.0 47228.0 4174.0 91.162023
3 3.0 200.0 49443.0 3649.0 92.619784
4 3.0 250.0 48489.0 3294.0 93.206707

Cluster data using DB SCAN algorithm

In [7]:
df_population_size_compare.to_pickle('df_population_size_compare.pkl')
In [182]:
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
df_population_size_compare['Compression'] = 100 * (1 - df_population_size_compare['Count']/df_population_size_compare['Coverage'])
df_population_size_compare.head()
Out[182]:
Minimum Neighbors Epsilon(m) Coverage Count Compression
0 3.0 50.0 29645.0 4808.0 83.781413
1 3.0 100.0 41927.0 4838.0 88.460896
2 3.0 150.0 47228.0 4174.0 91.162023
3 3.0 200.0 49443.0 3649.0 92.619784
4 3.0 250.0 48489.0 3294.0 93.206707
In [795]:
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Coverage'

x_start = 3
x_end = 20
max_x = []
max_y = []
max_z = []

max_2x = []
max_2y = []
max_2z = []

for i in range(x_start, x_end+1):
    # figure out the peak values line
    df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
    max_row = df[df[z_col] == df[z_col].max()]
    max_x.append(max_row[x_col].values[0])
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    
    # find the second peak line
    df = df_population_size_compare[df_population_size_compare[x_col]==i].reset_index(drop=True)
    peak2df = df[df[y_col] <= 300]
    max_2row = peak2df[peak2df[z_col] == peak2df[z_col].max()]
    max_2x.append(max_2row[x_col].values[0])
    max_2y.append(max_2row[y_col].values[0])
    max_2z.append(max_2row[z_col].values[0])
    
x = df_population_size_compare[x_col].values
y = df_population_size_compare[y_col].values
z = df_population_size_compare[z_col].values
    
traces = []
traces.append(go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=6,
        color=z,
        colorscale='Jet',   
        opacity=0.8
    ),
    showlegend=True, 
    name='Coverage'
))
# draw max line for z values
traces.append(go.Scatter3d(
    z=max_z,
    y=max_y,
    x=max_x,
    line=dict(
        color='teal',
        width = 4
    ),
    mode='lines',
    name='Max Counts Line'
))
# draw 2nd peak line for z values
traces.append(go.Scatter3d(
    z=max_2z,
    y=max_2y,
    x=max_2x,
    line=dict(
        color='purple',
        width = 4
    ),
    mode='lines',
    name='2nd Max Counts Line'
))


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=50,
        t=50
    ),
    
    paper_bgcolor='#999999',
    title='Clustered Points Coverage vs. Minimum Neighbors & Distance (meters)',
    scene=dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=1.25, y=1.25, z=1.25)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)

fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_coverage">DBSCAN Min. Neighbors & Distance vs Coverage</a>'))
iplot(fig, filename='3d-scatter-colorscale')

From the 3D Scatter Heat Plot above, we can observe that the cluster Coverage (total number of points clustered) is inversly proportional to Minimum Neighbors count where it is maximized at mn=x=3, whereas, it has 2 peaks on Maximum Distance Epsilon(e) axis. First peak is at 550 meters, second peak is between values of 50 to 350 for values of Minimum Neighbors less than 6.

To further narrow down to ideal parameters, we will look at Ribbon Plot for values of Minimum Neighbors X-axies and Maximum Distance Epsilon Y-axis against number of clusters that resulted from clusterings Count. The aim is to narrow down to a range where cluster count is maximized.

In [796]:
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Count'


x_start = 3
x_end = 20
y_start = 0
y_end = 30
traces = []
max_x = []
max_y = []
max_z = []
for i in range(x_start, x_end+1):
    x = []
    y = []
    z = []
    ci = int(255/18*i) # "color index"
    df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
    max_row = df[df[z_col] == df[z_col].max()]
    max_x.append(max_row[x_col].values[0])
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    max_x.append(max_row[x_col].values[0] + 0.5)
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    
    for j in range(y_start, y_end):
        x.append([i, i+.5])
        y.append([df.loc[j,y_col], df.loc[j,y_col]])
        z.append([df.loc[j,z_col], df.loc[j,z_col]])
    traces.append(dict(
        z=z,
        x=x,
        y=y,
        colorscale=[ [i, 'rgb(255,%d,%d)'%(ci, ci)] for i in np.arange(0,1.1,0.1) ],
        showscale=False,
        type='surface'
    ))
# draw max line for z values
traces.append(go.Scatter3d(
    z=max_z,
    y=max_y,
    x=max_x,
    line=dict(
        color='green',
        width = 8
    ),
    mode='lines',
    name='Max Counts Line'
))


layout = go.Layout(
    autosize=True,
    height=500,
    margin=go.layout.Margin(
        l=0,
        r=0,
        b=0,
        t=50,
        pad=0
    ),
    paper_bgcolor='#999999',
    title='Clustered Ribbons of Cluster Count vs. on Minimum Neighbors & Distance (meters)',
    scene=dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=1.5, y=1.5, z=1.5)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)
fig = { 'data':traces, 'layout': layout }
display(HTML('<a id="mn_e_count">DBSCAN Min. Neighbors & Distance vs Cluster Count</a>'))
iplot(fig, filename='ribbon-plot-python')

The Ribbon Chart above shows that number of clusters grows inversly proprotional to number of Minimum Neighbors. It peaks around mn = 3. For Maximum Distance to include locations within a cluster Epsilon, cluster count peaks between values of 50 to 350.

We can observe that there is a convergence from both graphs (Coverage & Count) for ranges:

Minimum Neighbors : 3 - 6
Epsilon(e) : 50 - 350 meters

We will investigate only these ranges from here onwards.

In [358]:
df_population_dist_compare = pd.DataFrame(None, index=range(0,28), \
                                          columns=['Minimum Neighbors','Epsilon(m)','Min','Max', 'Labels'])
In [799]:
start_mn = 3
end_mn = 6
start_eps = 50
end_eps = 350

start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
    for e in range(start_eps, end_eps+50, 50):
        eps = e/1000/kms_per_radian
        dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
        .fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
        
        df = pd.DataFrame(dbscn.labels_, columns=['label'])
        
        df_counts = df.groupby(['label']).size().reset_index(name='count')
        df_counts = df_counts[(df_counts['label'] > -1) & (df_counts['count'] >= mn)]
        
        labels = [x for x in dbscn.labels_ if x != -1] # all labels except -1
        
        df_population_dist_compare.at[indx, 'Minimum Neighbors'] = mn
        df_population_dist_compare.at[indx, 'Epsilon(m)'] = e
        df_population_dist_compare.at[indx, 'Min'] = df_counts['count'].min()
        df_population_dist_compare.at[indx, 'Max'] = df_counts['count'].max()
        df_population_dist_compare.at[indx, 'Labels'] = labels
        
        indx = indx + 1
        print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
df_population_dist_compare.head()
Out[799]:
Minimum Neighbors Epsilon(m) Min Max Labels
0 3 50 3 235 [0, 1, 2, 2130, 3, 4, 5, 6, 7, 4, 8, 9, 10, 11...
1 3 100 3 1750 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11, 366...
2 3 150 3 3180 [0, 1, 2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 11, ...
3 3 200 3 3493 [0, 1, 2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 12, ...
4 3 250 3 3903 [0, 1, 2, 3, 4, 5, 6, 1, 7, 1, 8, 9, 10, 5, 11...
In [360]:
df_population_dist_compare.to_pickle('df_population_dist_compare.pkl')
In [797]:
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')

x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Max'
x = df_population_dist_compare[x_col].values
y = df_population_dist_compare[y_col].values
z = df_population_dist_compare[z_col].values
zmin = df_population_dist_compare[z_col].min()
zmax = df_population_dist_compare[z_col].max()
intensity = (df_population_dist_compare[z_col].values - zmin)/(zmax-zmin)

traces = []
traces.append(
    go.Mesh3d(
        x = x,
        y = y,
        z = z,
        intensity = z,
        opacity=0.6,
        colorscale = 'Earth',
        reversescale=True
    )
)

layout = go.Layout(
    title='Largest Cluster vs. Min Neighbors and Epsilon',
    paper_bgcolor='#999999',
    scene = dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=-2, y=-.8, z=0.3)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_largest_cluster">DBSCAN Min. Neighbors & Distance vs Largest Cluster Size</a>'))
iplot(fig, filename='max-3d-mesh')
In [792]:
numCols = 4
fig = tools.make_subplots(rows=7, cols=4)

idx = 0
for index, row in df_population_dist_compare.iterrows():

    trace = go.Histogram(
        x = row['Labels'],
        name = "mn:{}<br>e:{}" \
        .format(row['Minimum Neighbors'], row['Epsilon(m)'])
    )
    i,j = idx // numCols + 1, idx % numCols + 1
    fig.append_trace(trace, i, j)
    idx = idx + 1
    fig['layout']['xaxis' + str(idx)]['tickformat'] = 's'
    fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['paper_bgcolor'] = '#999999'
fig['layout']['font']['color'] = '#ffffff'
fig['layout']['font']['size'] = 9
fig['layout']['xaxis']['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
display(HTML('<a id="mn_e_histograms">DBSCAN Label Count Histogram for Min. Neighbors & Distance</a>'))
iplot(fig, filename='binning function')
This is the format of your plot grid:
[ (1,1) x1,y1 ]    [ (1,2) x2,y2 ]    [ (1,3) x3,y3 ]    [ (1,4) x4,y4 ]  
[ (2,1) x5,y5 ]    [ (2,2) x6,y6 ]    [ (2,3) x7,y7 ]    [ (2,4) x8,y8 ]  
[ (3,1) x9,y9 ]    [ (3,2) x10,y10 ]  [ (3,3) x11,y11 ]  [ (3,4) x12,y12 ]
[ (4,1) x13,y13 ]  [ (4,2) x14,y14 ]  [ (4,3) x15,y15 ]  [ (4,4) x16,y16 ]
[ (5,1) x17,y17 ]  [ (5,2) x18,y18 ]  [ (5,3) x19,y19 ]  [ (5,4) x20,y20 ]
[ (6,1) x21,y21 ]  [ (6,2) x22,y22 ]  [ (6,3) x23,y23 ]  [ (6,4) x24,y24 ]
[ (7,1) x25,y25 ]  [ (7,2) x26,y26 ]  [ (7,3) x27,y27 ]  [ (7,4) x28,y28 ]

Based on the histograms drawn above, Teal hisgtogram with Minimum Neighbor distance of 100 meters and Epsilon(e) value of 4 would be our parameters of choice due to the following reasons:

  1. In Cluster Count Ribbon Graph, it is on the maximum curve. It will provide highest number of clusters for minimum 4 neighbors
  2. In Coverage Scatter Graph, it is well above mn=100 and e=5 and rest of the value, only below outliers (which would potentially include noise)
  3. It is in the lower (earth) range of the surface graph which indicates that maximum count of businesses in a cluster will be minimized.
  4. Its histogram is least skewed for e=4 values, which means that its clusters would be more evenly distributed than higher e value.
  5. We will not be selecting e=3 even though it has most evenly distributed histograms because it will not optimize number of cluster.

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
In [375]:
epsilon = 0.1 / kms_per_radian
min_neighbors = 4
In [455]:
start_time = time.time()

fd_coordinates = pd.read_pickle('fd_coordinates.pkl')
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')

dbscn = DBSCAN(eps=epsilon, min_samples=min_neighbors, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))

cluster_labels = dbscn.labels_

print(dbscn)

num_clusters = len(set(cluster_labels))

message = ' Total points clustered: {:,} \n Number of clusters: {:,} \n Compression ratio: {:.1f}% \n Time taken: {:,.2f} seconds'
print(message.format(len(fd_coordinates), num_clusters, \
                     100*(1 - float(num_clusters) / len(fd_coordinates)), time.time()-start_time))

fd_cluster_labels = pd.DataFrame(cluster_labels, columns=['label'])
print('Number of labels:{}'.format(len(cluster_labels)))
fd_cluster_labels.to_pickle('fd_cluster_labels.pkl')
fd_cluster_labels.head()

# Join cluster labels with the original dataset of the restaurants
df_restaurants_labeled = df_restaurants_flagged.join(pd.DataFrame(fd_cluster_labels))

# Filter out clusters that do not qualify requirements of minimum neighbors
df_rst_lbl_grouped = df_restaurants_labeled.groupby(['label']).size().reset_index(name='count')
df_lbl_counts = df_rst_lbl_grouped[(df_rst_lbl_grouped['label'] > -1) \
                                   & (df_rst_lbl_grouped['count'] >= min_neighbors)].set_index('label')

# Remove all restaurants that were not labeled
df_restaurants_label_filtered = df_restaurants_labeled.join(df_lbl_counts, on='label', how='inner')

df_restaurants_labeled.to_pickle('df_restaurants_labeled.pkl')

print(len(df_restaurants_label_filtered))

df_restaurants_label_filtered.to_pickle('df_restaurants_label_filtered.pkl')
df_restaurants_label_filtered.head()
DBSCAN(algorithm='ball_tree', eps=1.5696101377226163e-05, leaf_size=30,
    metric='haversine', metric_params=None, min_samples=4, n_jobs=1,
    p=None)
 Total points clustered: 56,839 
 Number of clusters: 3,298 
 Compression ratio: 94.2% 
 Time taken: 3.51 seconds
Number of labels:56839
36583
Out[455]:
business_id latitude longitude city neighborhood state postal_code stars categories Sandwiches ... Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern label count
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches] 1 ... 0 0 0 0 0 0 0 0 0 10
5343 ps03u_P469lpTqYHOedgUw 45.540476 -73.598844 Montréal Rosemont-La Petite-Patrie QC H2G 1K8 4.0 [] 0 ... 0 0 0 0 0 0 0 0 0 10
6393 GJHTRlN2nUR0gAeMtSrhQQ 45.541066 -73.598421 Montréal Rosemont-La Petite-Patrie QC H2G 1K9 1.5 [Sushi Bars, Japanese] 0 ... 0 0 0 0 0 0 0 0 0 10
10210 qbU7xNsZnxD5Gm9pIG2Xtw 45.541192 -73.598197 Montréal Rosemont-La Petite-Patrie QC H2G 1K9 4.5 [] 0 ... 0 0 0 0 0 0 0 0 0 10
18485 tazvb01hBe6kQAxqtwSlUQ 45.541042 -73.598469 Montréal Rosemont-La Petite-Patrie QC H2G 1K9 5.0 [] 0 ... 0 0 0 0 0 0 0 0 0 10

5 rows × 31 columns



User Data Preparation

Filter reviews data to only include filtered restaurants reviews

In [456]:
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_review_data = pd.read_pickle('df_review_data.pkl')

df_reviews_and_restaurants = df_review_data.join(df_restaurants_label_filtered.set_index('business_id'), \
                                                 on='business_id', how='inner')
print(len(df_reviews_and_restaurants))
df_reviews_and_restaurants.head()
2752070
Out[456]:
business_id review_id user_id latitude longitude city neighborhood state postal_code stars ... Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern label count
0 iCQpiavjjPzJ5_3gPD5Ebg x7mDIiDB3jEiPGPHOmDzyw msQe1u7Z_XuqjGoqhB0J5g 36.109837 -115.174212 Las Vegas The Strip NV 89109 4.0 ... 0 0 0 0 0 0 0 0 88 105
788 iCQpiavjjPzJ5_3gPD5Ebg VluIpojME0yKOcRKI5L0Iw D1_nrBr4dOrs7M82OaBRwQ 36.109837 -115.174212 Las Vegas The Strip NV 89109 4.0 ... 0 0 0 0 0 0 0 0 88 105
3314 iCQpiavjjPzJ5_3gPD5Ebg nsB9JAeghk0zOaSulSm9Yw jEjc8HxeCeI22M4di0Y1aw 36.109837 -115.174212 Las Vegas The Strip NV 89109 4.0 ... 0 0 0 0 0 0 0 0 88 105
6726 iCQpiavjjPzJ5_3gPD5Ebg pcn01EMERpCpHEcoaohdEg RTiaaHSN80VRTLSdNK5rfQ 36.109837 -115.174212 Las Vegas The Strip NV 89109 4.0 ... 0 0 0 0 0 0 0 0 88 105
8969 iCQpiavjjPzJ5_3gPD5Ebg 0mFtAPTmInbXHqDjX9eiOg hBmJs49zYcXdV4wo8F5iiA 36.109837 -115.174212 Las Vegas The Strip NV 89109 4.0 ... 0 0 0 0 0 0 0 0 88 105

5 rows × 33 columns

Group each user's review for each category restaurants. Higher the count of reviews for a certain category, more the user is likely to visit that category of restaurant.

In [457]:
df_user_rst_visits = df_reviews_and_restaurants.filter(['user_id'] + top_20_specific_categories , axis=1) \
.groupby(['user_id']).sum()

df_user_rst_visits.to_pickle('df_user_rst_visits.pkl')
df_user_rst_visits.head()
Out[457]:
Sandwiches American (Traditional) Pizza Burgers Italian Mexican Chinese American (New) Japanese Chicken Wings Seafood Sushi Bars Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern
user_id
---1lKK3aKOuomHnwAkAow 7 5 7 3 7 5 0 14 2 0 4 3 0 3 0 4 2 2 0 0
---94vtJ_5o_nikEs6hUjg 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
---PLwSf5gKdIoVnyRHgBA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
---cu1hq55BP9DWVXXKHZg 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
---udAKDsn0yQXmzbWQNSw 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

Restaurant/Review Count Ratio: The more the users are reviewing a particular category restaurants, the more they are interested in eating that particular kind of food. Thus overall review count of a restaurant category indicates the interest of users in that category of food and restaurants.

Overall in entire population, equilibrium should exist between review count indicating desire (let's call it Demand Indicator) of a particular restaurant's food type and number of restaurants reviewed of that category that cater to that demand (Supply Indicator).

We can calculate the ratio of the number of restaurants to the number of reviews of each category to find out the ratio by which user interest translates into restaurant count of that category in overall population.

In [458]:
df_demand_indicator_by_category = df_user_rst_visits.sum()
df_demand_indicator_by_category.to_frame('Demand (Review Count)')
Out[458]:
Demand (Review Count)
Sandwiches 287774
American (Traditional) 429317
Pizza 232112
Burgers 235403
Italian 239468
Mexican 239800
Chinese 174176
American (New) 465398
Japanese 223735
Chicken Wings 66181
Seafood 220247
Sushi Bars 177744
Canadian (New) 68030
Asian Fusion 167424
Mediterranean 94187
Steakhouses 160742
Indian 49903
Thai 95392
Vietnamese 63267
Middle Eastern 42977
In [459]:
review_restaurant_ratio = df_supply_indicator_by_category/df_demand_indicator_by_category
df_restaurant_review_ratio = review_restaurant_ratio.to_frame('Supply/Demand (Restaurant/Review) Ratio')
df_restaurant_review_ratio
Out[459]:
Supply/Demand (Restaurant/Review) Ratio
Sandwiches 0.024012
American (Traditional) 0.015487
Pizza 0.028340
Burgers 0.021724
Italian 0.018804
Mexican 0.018399
Chinese 0.024314
American (New) 0.009087
Japanese 0.011464
Chicken Wings 0.038334
Seafood 0.010697
Sushi Bars 0.012113
Canadian (New) 0.026870
Asian Fusion 0.010602
Mediterranean 0.018485
Steakhouses 0.009469
Indian 0.028235
Thai 0.014550
Vietnamese 0.019362
Middle Eastern 0.027503

Save supply/demand ratio indicator for each category of the restaurant

In [460]:
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_user_rst_visits = pd.read_pickle('df_user_rst_visits.pkl')
df_restaurant_review_ratio = pd.read_pickle('df_restaurant_review_ratio.pkl')
fd_coordinates = pd.read_pickle('fd_coordinates.pkl')
fd_cluster_labels = pd.read_pickle('fd_cluster_labels.pkl')
df_restaurants_labeled = pd.read_pickle('df_restaurants_labeled.pkl')
In [619]:
gb = df_restaurants_label_filtered.groupby(['label'])
df_clust_group_info = pd.DataFrame({'size': gb.size()})
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_restaurant_review_ratio_tps = df_restaurant_review_ratio.transpose()

start_time = time.time()

def get_group_info(cur_cluster):
    groupSize = len(cur_cluster)
    df_clust_group_info.at[cur_cluster.name, 'size'] = groupSize
    df_clust_group_info.at[cur_cluster.name, 'latitude'] = cur_cluster['latitude'].sum()/groupSize
    df_clust_group_info.at[cur_cluster.name, 'longitude'] = cur_cluster['longitude'].sum()/groupSize
    df_clust_group_info.at[cur_cluster.name, 'city'] = pd.Series(cur_cluster['city'].unique()).str.cat(sep=', ')
    df_clust_group_info.at[cur_cluster.name, 'zip'] = pd.Series(cur_cluster['postal_code'].unique()).str.cat(sep=', ')
    df_clust_group_info.at[cur_cluster.name, 'neighborhood'] = pd.Series(cur_cluster['neighborhood'].unique()).str.cat(sep=', ')
    df_cur_cluster_reviews = cur_cluster[['business_id']].join(df_bus_reviews, on='business_id', how='inner')
    df_cur_cluster_unique_users = df_cur_cluster_reviews[['user_id']].drop_duplicates()
    df_clust_user_rst_visits = df_cur_cluster_unique_users.join(df_user_rst_visits, on='user_id')
    df_clust_group_info.at[cur_cluster.name, 'reviews_count'] = len(df_cur_cluster_reviews)
    df_clust_group_info.at[cur_cluster.name, 'user_count'] = len(df_cur_cluster_unique_users)
    for category in top_20_specific_categories:
        df_clust_group_info.at[cur_cluster.name, category + ' Supply'] =  cur_cluster[category].sum()
        df_clust_group_info.at[cur_cluster.name, category + ' Demand'] =  df_clust_user_rst_visits[category].sum() \
        * df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]
    print('Time taken: {:,.2f} minutes - Group # {}'.format((time.time()-start_time)/60, cur_cluster.name))
gb.apply(get_group_info)
df_clust_group_info.head()
Out[619]:
size latitude longitude city zip neighborhood reviews_count user_count Sandwiches Supply Sandwiches Demand ... Steakhouses Supply Steakhouses Demand Indian Supply Indian Demand Thai Supply Thai Demand Vietnamese Supply Vietnamese Demand Middle Eastern Supply Middle Eastern Demand
label
0 10 45.540408 -73.599135 Montréal H2G 1K7,H2G 1K8,H2G 1K9,H2S 1V1,H2S 1K7,H2S 1T... Rosemont-La Petite-Patrie 144.0 142.0 1.0 5.380714 ... 0.0 0.877739 0.0 1.711929 1.0 0.661825 3.0 1.529594 0.0 1.940887
1 4 43.713006 -79.633204 Mississauga L4T 1A8 Ridgewood 46.0 44.0 0.0 1.284429 ... 0.0 0.256007 0.0 1.196288 1.0 0.709098 0.0 0.551915 0.0 1.160313
2 6 43.862432 -79.306721 Markham,Unionville L3R 1M5 Unionville 334.0 300.0 0.0 12.809571 ... 0.0 2.062687 0.0 5.837059 0.0 7.067342 0.0 11.621763 0.0 4.261512
3 217 43.665082 -79.411405 Toronto M6G 2L7,M6G 1K8,M6G 1K3,M5S 1X2,M5S 1Y6,M5S 1Y... Palmerston,Koreatown,The Annex,,Seaton Village 12525.0 7452.0 12.0 193.896641 ... 0.0 22.418925 5.0 104.860802 7.0 83.295364 5.0 97.058275 12.0 80.504605
4 408 45.519347 -73.578376 Montréal,Montreal H2W 1Z3,H2X 2T6,H2J 2J3,H2W 1J1,H2X 2X7,H2W 1Z... Plateau-Mont-Royal,Ville-Marie 18093.0 10739.0 36.0 207.105426 ... 8.0 29.236034 15.0 66.043330 12.0 40.867674 16.0 57.068058 9.0 63.121011

5 rows × 48 columns

In [487]:
df_clust_group_info.to_pickle('df_clust_group_info.pkl')

For denser cluster population, interval with 20 limits is used to indicate size of each cluster and same number of colors to easily recognize each cluster on the map.

In [561]:
df_grouped_cluster_data = pd.read_pickle('df_clust_group_info.pkl')

mapbox_access_token = 'pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA'

# interval_20 = pd.interval_range(start=4, periods=20, freq=2, closed='both').to_tuples()
limits_20 = [(4,5),(6,10),(11,15),(16,20),(21,25),(26,30),(31,35),(35,40),(41,45),(45,50),(51,60), \
             (61,70),(71,80),(81,100),(101,150),(151,200),(201,300),(301,400),(401,1000),(1001,2000)]

colors_20 = ['RGB(230,25,75)','RGB(60,180,75)','RGB(255,225,25)','RGB(67,99,216)','RGB(245,130,49)', \
             'RGB(145,30,180)','RGB(70,240,240)','RGB(240,50,230)','RGB(188,246,12)','RGB(250,190,190)', \
             'RGB(0,128,128)', 'RGB(230,190,255)','RGB(154,99,36)','RGB(255,250,200)','RGB(170,255,195)', \
             'RGB(255,216,177)','RGB(0,0,117)','RGB(128,128,128)','RGB(128,0,0)','RGB(128,128,0)']

For city level map, 10 interval limits and 10 colors are used to indicate each cluster's size and color

In [562]:
label_sizes = df_restaurants_label_filtered[['business_id','label']].groupby(['label']).agg(['count'])
label_sizes['business_id']['count'].nlargest(10)
Out[562]:
label
8      1749
31      617
148     458
4       408
6       319
47      269
283     254
293     254
182     230
251     218
Name: count, dtype: int64
In [564]:
#interval_10 = pd.interval_range(start=4, periods=10, freq=4, closed='both').to_tuples()
limits_10 = [(4,10),(11,20),(21,30),(31,40),(41,50),(51,70),(71,100),(101,200),(201,400),(401,2000)]
colors_10 = ['#0000FF', '#008080',  '#FF0000', '#008000', '#808000', '#000080', '#C36900', \
          '#FF00FF', '#800080','#00FF00']
In [675]:
df_grouped_cluster_data.head()
df_grouped_cluster_data['label'] = df_grouped_cluster_data.index
for index,row in df_grouped_cluster_data.iterrows():
    df_grouped_cluster_data.at[index, 'neighborhood'] = (row['neighborhood'][:50] + (row['neighborhood'][:50] and '...'))
    df_grouped_cluster_data.at[index, 'zip'] = (row['zip'][:50] + (row['zip'][:50] and '...'))
In [771]:
clusters = []
scale = 1

for i in range(len(limits_20)):
    lim = limits_20[i]
    df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
                                      & (df_grouped_cluster_data['size'] <= lim[1]))]
    cluster = dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'City: ' + df_sub['city'] + \
        '<br>Neighborhood(s): ' + df_sub['neighborhood'] + \
        '<br> Zip/Postal Code(s):' + df_sub['zip'],
        sizemode = 'diameter',
        marker = dict( 
            size = [i*scale]*len(df_sub), 
            color = colors_20[i],
            line = dict(width = 2,color = 'black')
        ),
        name = '{0} - {1}'.format(lim[0],lim[1]) )
    clusters.append(cluster)

layout = dict(
        title = 'Yelp Reviewed Restaurants in North America',
        showlegend = True,
        geo = dict(
            scope='north america',
            projection=dict( type='albers usa canada' ),
            resolution= 50,
            lonaxis= {
                'range': [-150, -55]
            },
            lataxis= {
                'range': [30, 50]
            },
            center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',       
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 100)",
            countrycolor="rgba(255, 200, 255)"           
        ),  
    )
    
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="north_america_clustered">North America Clustered Restaurants by Location (All Categories)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
In [777]:
clusters = []
scale = 3

for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
                                      & (df_grouped_cluster_data['size'] <= lim[1]))]
    cluster = dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'City: ' + df_sub['city'] + \
        '<br>Size: ' + df_sub['size'].astype(str) + \
        '<br>Neighborhood: ' + df_sub['neighborhood'] + \
        '<br>Postal Code:' + df_sub['zip'],
        sizemode = 'diameter',
        marker = dict( 
            size = [i*scale]*len(df_sub),
            color = colors_10[i],
            line = dict(width = 2,color = 'black')
        ),
        name = '{0} - {1}'.format(lim[0],lim[1]) )
    clusters.append(cluster)

layout = dict(
        title = 'Yelp Reviewed Clustered Restaurants in Toronto',
        showlegend = True,
        geo = dict(
            scope='north america',
            projection=dict( type='albers usa canada', scale=500 ),
            resolution= 50,
            lonaxis= {
                'range': [-130, -55]
            },
            lataxis= {
                'range': [30, 50]
            },
                    center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',       
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(120, 120, 120)",
            countrycolor="rgb(255, 255, 255)"           
        ),  
    )
    
fig = dict( data=clusters, layout=layout )

display(HTML('<a id="toronto_clustered">All Clustered Restaurants on Sketch (Toronto)</a>'))

iplot( fig, validate=False, filename='d3-bubble-map-populations' )
In [794]:
clusters = []
scale = 4

for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
                                    & (df_grouped_cluster_data['size'] <= lim[1]))]
    cluster = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'Cluster #: ' + df_sub['label'].astype(str) + \
        '<br>Size: ' + df_sub['size'].astype(str) + \
        '<br>City: ' + df_sub['city'] + \
        '<br>Neighborhood: ' + df_sub['neighborhood'] + \
        '<br>Postal Code:' + df_sub['zip'],
        mode = 'markers',
        marker = dict( 
            size = [i*scale]*len(df_sub), 
            color = colors_10[i]
        ),
        name = '[{0} - {1}]'.format(lim[0],lim[1]) )
    border = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        mode='markers',
        marker=dict(
            size=[i * scale + 1]*len(df_sub),
            color='black',
            opacity=0.4
        ),
        hoverinfo='none',
        showlegend=False)
    clusters.append(border)
    clusters.append(cluster)
layout = go.Layout(
    title = 'Yelp Reviewed Clustered Restaurants on Toronto Map',
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
        pitch=0,
        zoom=12
    ),
    
)


fig = dict(data=clusters, layout=layout)

display(HTML('<a id="toronto_clustered_map">All Clustered Restaurants on Map (Toronto)</a>'))

iplot(fig, filename='Multiple Mapbox')
In [725]:
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]


# Add new Local Demand columns for each category
df_grouped_cluster_data[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_grouped_cluster_data[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])

scaler = MinMaxScaler(feature_range=(0, 1))

t = None
for index,row in df_grouped_cluster_data.iterrows():
    cluster_supply = row[supply_cats].transpose().sum()
    cluster_demand = row[demand_cats].transpose().sum()
    cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
    for x in top_20_specific_categories:
        localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
        df_grouped_cluster_data.at[index, x + ' Local Demand'] = localDemand
    # apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
    diff = row[supply_cats].values - df_grouped_cluster_data.loc[index, local_demand_cats].values
    scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
    for i in range(len(diff)):
        df_grouped_cluster_data.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
            

df_grouped_cluster_data[display_cats].head() 
        
Out[725]:
Sandwiches Display American (Traditional) Display Pizza Display Burgers Display Italian Display Mexican Display Chinese Display American (New) Display Japanese Display Chicken Wings Display Seafood Display Sushi Bars Display Canadian (New) Display Asian Fusion Display Mediterranean Display Steakhouses Display Indian Display Thai Display Vietnamese Display Middle Eastern Display
label
0 0.400000 0.400000 0.400000 0.200000 0.200000 0.400000 0.4 0.400000 0.60000 0.400000 0.400000 0.600000 0.00000 0.600000 0.400000 0.400000 0.400000 0.600000 1.000000 0.400000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.00000 1.000000 1.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
2 0.333333 0.666667 0.333333 0.666667 0.333333 0.333333 0.0 0.333333 1.00000 0.333333 0.666667 0.333333 0.00000 0.666667 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333
3 0.571429 0.685714 0.685714 0.571429 0.485714 0.714286 0.2 0.542857 1.00000 0.657143 0.542857 0.885714 0.00000 0.742857 0.657143 0.542857 0.542857 0.628571 0.542857 0.800000
4 0.718750 0.468750 0.687500 0.750000 0.718750 0.843750 0.0 0.531250 0.96875 0.593750 0.750000 0.843750 0.09375 0.843750 1.000000 0.781250 0.812500 0.843750 0.875000 0.625000
In [793]:
clusters = []
scale = 4
colors = ['maroon', 'purple', 'navy', 'teal', 'olive']
for x in range(0, len(top_20_specific_categories)):
    cat = top_20_specific_categories[x]
    for i in range(len(limits_10)):
        lim = limits_10[i]
        df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
                                        & (df_grouped_cluster_data['size'] <= lim[1]))]
        
        demandStr, supplyStr = '{} Demand'.format(cat), '{} Supply'.format(cat)
        local_sd_ratio = df_sub[demandStr].max() / df_sub[supplyStr].max()
        
        cluster = go.Scattermapbox(
            lon = df_sub['longitude'],
            lat = df_sub['latitude'],
            text = 'Category: {}'.format(cat) + \
            '<br>Size: ' + df_sub['size'].astype(str) + \
            '<br>City: ' + df_sub['city'] + \
            '<br>Demand: ' + df_sub['{} Local Demand'.format(cat)].astype(str) + \
            '<br>Supply: ' + df_sub['{} Supply'.format(cat)].astype(str) + \
            '<br>Neighborhood: ' + df_sub['neighborhood'] + \
            '<br>Postal Code:' + df_sub['zip'],
            mode = 'markers',
            marker = dict( 
                size = [i*scale]*len(df_sub), 
                color = colors[x % 5],
                opacity = df_sub['{} Display'.format(cat)] 
            ),
            name = '[{0} - {1}]'.format(lim[0],lim[1]) ,
            visible= (False if x > 0 else True)
        )
        clusters.append(cluster)
        
# add border for all clusters
for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
                                    & (df_grouped_cluster_data['size'] <= lim[1]))]
    border = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        mode='markers',
        marker=dict(
            size=[i * scale + 1]*len(df_sub),
            color='black',
            opacity=0.1
        ),
        hoverinfo='none',
        visible=True,
        showlegend=False)
    clusters.append(border)

        
        
steps = []
trc_count = 10
category_size = len(top_20_specific_categories)
v = [False] * traces_per_category * category_size + [True] * traces_per_category

for i in range(0, category_size):
    step = dict(method='restyle',
                args=['visible', v[0:i * trc_count] + [True] * trc_count + v[ (i+1) * trc_count: len(v)]],
                label='{}'.format(top_20_specific_categories[i]))
    steps.append(step)

sliders = [dict(active=0,
                pad={"t": 1},
                steps=steps)]  
        
layout = go.Layout(
    title = 'Yelp Reviewed Restaurants Supply/Demand by Category Slider',
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
        pitch=0,
        zoom=12
    ),
    sliders = sliders
)


fig = dict(data=clusters, layout=layout)



display(HTML('<a id="toronto_clustered_categorized">Slider Controlled Categories Displaying Demand (Toronto)</a>'))

iplot(fig, filename='Multiple Mapbox')